Evaluation Metrics
1. Text-to-Video Retrieval Recall
The text-to-Video retrieval task aims to establish semantic associations between text and video, enabling bidirectional retrieval. To comprehensively evaluate the performance of a retrieval system, a series of recall-based metrics are typically used, including Top-K recall (R@K) for both video-to-text retrieval (Video-to-Text, V2T) and text-to-video retrieval (Text-to-Video, T2V), as well as Mean Recall, which provides an overall measure of retrieval performance. These metrics can effectively reflect the model’s performance under different levels of retrieval difficulty, where Top-1 indicates precise matching ability and Top-5/10 indicate fault tolerance. Furthermore, these metrics help researchers evaluate the strengths and weaknesses of a model from multiple perspectives, such as whether it can still reliably return correct results when facing retrieval targets with similar semantics but subtle differences, or whether it can still include the corresponding video or text within the top K results when the video description is vague or incomplete, thereby demonstrating the model’s robustness and generalization ability.
1.1 Recall (Recall@K):
Here,
2.1 Video-to-Text Recall (V2T R@K):
2.2 Text-to-Video Recall (T2V R@K):
2.3 Mean Recall:
Mean Recall aggregates multiple Top-K recall values from both the V2T and T2V directions into a single comprehensive metric, providing an overall evaluation of the retrieval system’s average performance across different directions and difficulty levels. A higher Mean Recall indicates that the model performs well across all retrieval dimensions, excelling not only in unidirectional retrieval but also in diverse matching scenarios.
These metrics, through multi‑granularity evaluation (K=1,5,10), comprehensively reflect the retrieval system’s precise matching ability and fault tolerance, where:
- R@1 measures strict matching accuracy and reflects whether the system can directly find the correct result in the first position
- R@5 / R@10 reflect the system’s robustness when the retrieval range is relaxed, indicating whether the model can still cover the correct match under less strict conditions
- Mean Recall provides a single metric for overall performance, allowing researchers to quickly compare the comprehensive performance of different models or configurations These metrics have been widely applied in the evaluation of models on mainstream video retrieval benchmarks such as MSR‑VTT and ActivityNet, and they are highly reliable and widely used.
2.4 Code
Retrieval metric calculation code: retrieval_evaluator